Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ALLOC: Fail Stale Primary Alloc. Req. without Data #37226

Merged

Conversation

original-brownbear
Copy link
Member

* Get indices shard store status before enqueuing the reallocation state update task to prevent
tasks that would fail because a node does not hold a stale copy of the shard on a best effort basis
* Closes elastic#37098
@original-brownbear original-brownbear added >bug v7.0.0 :Distributed/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) v6.7.0 labels Jan 8, 2019
@original-brownbear original-brownbear added :Distributed/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) and removed :Distributed/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) labels Jan 8, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed

@@ -135,7 +134,7 @@ public void testFailedRecoveryOnAllocateStalePrimaryRequiresAnotherAllocateStale
assertThat(shardRouting.unassignedInfo().getReason(), equalTo(UnassignedInfo.Reason.ALLOCATION_FAILED));
});

try(Store store = new Store(shardId, indexSettings, new SimpleFSDirectory(indexPath), new DummyShardLock(shardId))) {
try (Store store = new Store(shardId, indexSettings, new SimpleFSDirectory(indexPath), new DummyShardLock(shardId))) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for these two noisy cleanups that snuck into this file

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should back them out for the sake of future git annotate users :)

Copy link
Contributor

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. I left a small number of small comments.

I would rather avoid the // code comments that you added - they don't really say anything that isn't already clear to me from the code, and I always worry about comments like this that fall out of sync later.

for (AllocationCommand command : request.getCommands().commands()) {
if (command instanceof AllocateStalePrimaryAllocationCommand) {
if (stalePrimaryAllocations == null) {
stalePrimaryAllocations = new HashMap<>();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can afford to make this HashMap eagerly and avoid the noise in the loop.

IllegalArgumentException.class,
() -> client().admin().cluster().prepareReroute().add(new AllocateStalePrimaryAllocationCommand("test", 0,
dataNodeWithNoShardCopy, true)).get());
assertThat(iae.getMessage(), equalTo("No data for shard [0] of index [test] found on node [" + dataNodeWithNoShardCopy + ']'));

logger.info("--> wait until shard is failed and becomes unassigned again");
assertBusy(() ->
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we no longer want this to be an assertBusy() - we expect that the previous reroute didn't do anything, so it should still be in this state from beforehand.

@@ -135,7 +134,7 @@ public void testFailedRecoveryOnAllocateStalePrimaryRequiresAnotherAllocateStale
assertThat(shardRouting.unassignedInfo().getReason(), equalTo(UnassignedInfo.Reason.ALLOCATION_FAILED));
});

try(Store store = new Store(shardId, indexSettings, new SimpleFSDirectory(indexPath), new DummyShardLock(shardId))) {
try (Store store = new Store(shardId, indexSettings, new SimpleFSDirectory(indexPath), new DummyShardLock(shardId))) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should back them out for the sake of future git annotate users :)

@original-brownbear
Copy link
Member Author

@DaveCTurner all points addressed, thanks for taking a look!

@original-brownbear
Copy link
Member Author

Jenkins run gradle build tests 1

@original-brownbear
Copy link
Member Author

Jenkins run Gradle build tests 2

Copy link
Contributor

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implementation looks good, I asked for a couple more tests.

final String index = entry.getKey();
final ImmutableOpenIntMap<List<IndicesShardStoresResponse.StoreStatus>> indexStatus = status.get(index);
if (indexStatus == null) {
e = ExceptionsHelper.useOrSuppress(e, new IndexNotFoundException(index));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we have a test that hits this branch?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually no ... my bad this one is dead code. The logic in the index shard store status request already checks the index exists.
I added an assertion for this now and a test that makes sure it's not tripped when the index doesn't exist :)

indexStatus.get(command.shardId());
if (shardStatus == null) {
e = ExceptionsHelper.useOrSuppress(e, new IllegalArgumentException(
"No data for shard [" + command.shardId() + "] of index [" + index + "] found on any node")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we have a test that hits this branch?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in ea98277 :)

@original-brownbear
Copy link
Member Author

@DaveCTurner thanks for taking a look => tests added and implementation simplified a little as well in ea98277 :)

@original-brownbear
Copy link
Member Author

@DaveCTurner ping :)

Copy link
Contributor

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@original-brownbear
Copy link
Member Author

@DaveCTurner thanks!

@original-brownbear original-brownbear merged commit 46237fa into elastic:master Jan 10, 2019
@original-brownbear original-brownbear deleted the stale-allocation-problem branch January 10, 2019 15:28
original-brownbear added a commit that referenced this pull request Jan 28, 2019
* Get indices shard store status before enqueuing the reallocation state update task to prevent
tasks that would fail because a node does not hold a stale copy of the shard on a best effort basis
* Closes #37098
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :Distributed/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) v6.7.0 v7.0.0-beta1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants